Azure: Synapse Analytics w/ DataBricks and DevOps

Andrew Fogarty

05/07/2021

# load python
library(reticulate)
use_python('C:/Users/Andrew/Anaconda3/')
use_condaenv(condaenv='my_ml', required=TRUE)
library(knitr)

1 Introduction

Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated resources—at scale. Azure Synapse brings these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.

In this demo, we will walk through the necessary procedures to:

2 Create an Azure Synapse Environment

In order to create an Azure Synapse environment, several preparatory objects are needed:

In the image below, we create the Synapse workspace, storage account, and file system name in one single effort.

3 Inject SQL Tables into our Data Lake

We can now launch Azure Synapse Analytics, which looks like the below image:

To use a modern data warehousing technology like a Data Lake, we need to fill it with data. Data Lakes store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages. Synapse can fill our Data Lake with data from just about any source we can imagine, but to do that, we must setup connections between the sources of our data (e.g., SQL tables) and the sink (our Data Lake) so that we can copy the data and have our own versions for analysis.

Our first task is to connect to an Azure SQL Database which has the commonly known AdventureWorks data set already loaded. First, we open our Data tab, select Linked, press the + sign, and then select Integration dataset to open up a new connection to our Azure SQL Database from where we want to copy its data into our Data Lake.

Next, we instantiate our connection by creating a new link to the Azure SQL Database by specifying the Azure subscription it is under, the Azure SQL Server name, the database name, the user name, and the password. Lastly, we test the connection to ensure we entered our login credentials appropriately.

Next, we want to pass on importing a schema or specifying a Table name.

Next, we want to use dynamic parameterization which we will see will be useful later because it means we can easily import many tables into our Azure Data Lake simultaneously. This means that we need to create two parameters: (1) TableName, and (2) SchemaName.

Now, we need to return to the Connection tab and specify some dynamic changes to our Table entry as follows:

With that done, we need to create a new linked integration data set. Like above, we open our Data tab, select Linked, press the + sign, and then select Integration dataset to open up a new connection to our Azure Data Lake Storage Gen2 and then select Parquet. Again, we do not want to specify an import schema, but we will specify our root folder path as our file system name that we initially created demolake.

Next we will create our parameters for our parquet files: (1) SchemaName, and (2) TableName.

Now we need to specify some important Data Lake architecture which is to place our raw data into a specific raw folder. In general, we want to have three types of data in our Data Lake:

Since we have a clean Data Lake and since we are injecting raw data into it, we need to make some edits to where our eventual Parquet files will land. We do so by selecting the Connection tab and specifying a dynamic file path like so:

What this will do is that when it the time comes to copy data into our Data Lake, we will place our newly generated Parquet files into the RAW folder whereby each Schema gets its own base folder and each Table gets its own folder inside of it.

Now we must create one more integrated dataset which will provide us a list of our tables to query, retrieve, and copy into our Data Lake. Again, we will select Data tab, select Linked, press the + sign, and then select Integration dataset to open up a new connection to our Azure SQL Database. We will give it a new descriptive name and use our previous connection to the SQL Database, and we will not parameterize it.

Now we are ready to copy our data with the use of pipelines. To create a new pipeline, we need to select the Integrate tab, press the + sign, and select Pipeline. Like other steps, we give it a descriptive name like For Each Copy, as we are copying the data into our Data Lake for each table we find.

Our next task is to generate the list of tables we want to extract. To do so, we select the General drop down and drag Lookup into the space. We will also add in our ForEach method which is found under Iteration & conditionals. We give our Lookup an understandable name, Table List and then select the Settings tab. Here, we want to choose as our Azure SQL Database which we previously named as sql_aw as the source. Next, we importantly want to select Query and use the following query:

SELECT
    *
FROM
    adventureworks.INFORMATION_SCHEMA.TABLES
WHERE table_type = 'BASE TABLE'

Note that if we wanted to extract just a few of the tables, we could do so by creating an extraction of tables and saving them like so. This command is probably best run from: Azure Home -> SQL databases -> select the database (e.g., adventureworks) -> Query editor -> execute query below.

CREATE TABLE ExportedTables2 (
    ID int identity(1,1) Primary Key,
    Table_Schema varchar(100) not null,
    Table_Name varchar(100) not null
    );
GO
INSERT INTO ExportedTables2(Table_Schema, Table_Name)
VALUES
('SalesLT', 'Address'),
('SalesLT', 'Customer'),
('SalesLT', 'Product')

-- Check on the result:

SELECT * FROM [dbo].[ExportedTables2]

It is also important to un-check First row only, otherwise we will only return a single table.

The query above accesses the Azure SQL Database that we opened a connection to and extractions the list of tables available. We can verify that it works by pressing Preview data.

Next, we need to establish effectively a for loop to run the same data copying command across each of the tables that we desire. Again, we give our for each action a more descriptive name For Each Table and then we proceed to the Settings tab where we Add dynamic content. We scroll down and find that Table List is an option, our previous step, and it tells us it will take in the output from the table list. We need to modify it slightly by specifying the dynamic content to the value like so: @activity('List Tables').output.value. Click finish.

Next, we proceed to the Activities tab where we click the pencil to edit. Here, we want to drag on the Copy data action from Move & transform. We first give it a more descriptive name, Export Tables to Parquet, and then move to the Source tab where we select our Azure SQL Database. Here, we must set new dynamic content for our TableName and SchemaName parameters. Since we are working in a loop, we will manually specify that the parameters for this copy job are to follow the current item, so we specify these values which extracts the necessary information from our SQL Database and can be cross-referenced in the section above that incorporates SQL code.

Then, we select the Sink tab to choose where these SQL tables will be sent to. Since we want this data sent to Parquet files, we choose our integrated dataset that we made earlier, sql_aw_to_pq Again, we manually specify the parameters with the @item preposition.

Lastly, we need to publish our changes and update our workspace so we can execute our pipeline. For debugging and ensuring we did everything correctly, we can return to the Integrate section from our sidebar, select our pipeline For Each Copy and press Debug. It will run through the process and we can see to what extent it succeeded and failed during the pipeline. As we can see from the image below, we succeeded in copying all of the tables from the SQL database, parqueting the data, and placing that data in our Data Lake.

To run this process on a schedule or to re-run it, we can select Add trigger and press Trigger now to run our pipeline again. To view the process, we can select the Monitor tab on the sidebar.

We can see the results of our work by going to our Data tab, selecting our Azure Data Lake Storage Gen2, selecting our file system demolake and view our Parquet files.

Success! We have now populated an Azure Data Lake with data from an Azure SQL Database that we can now proceed to analyze.

4 Connect Azure Data Lake to Azure Data Bricks

Our next logical task is to use our data in analysis with Data Bricks for parallel computing. To begin this process, we want to create an Azure Databricks workspace through the Azure Services homepage.

With our Data Bricks instance created, our next step is to create the security architecture that will allow Data Bricks to interact with our Data Lake.

Our first step is to create a key vault named dbkvdemo using Azure Key vaults. Next, we access Azure App registrations and create a new registration called db_app. We want to extract a couple important items from our registration once it is completed:

Next, we need to create a Client secret inside our App registrations for db_app. We do this by selecting Certificates & secrets in the left pane, clicking + for New client secret, adding a description db_secret and setting an appropriate expiration. We then want to copy the value for db_secret which should now be visible and will no longer be available after we leave this window: h_tVc.b2Z-Pq-alr~WU56mNm8V00Sez8l6.

Now, we need to return to our Key vaults where we select our key vault dbkvdemo and select Secrets on the left pane. We then select + Generate/Import, give it a name, dbappkey, and give it the value from the App registrations -> Certificates & secrets -> Client secrets -> Value (h_tVc.b2Z-Pq-alr~WU56mNm8V00Sez8l6).

Now, we need to create a Secret Scope in Data Bricks. We can access that by opening our Data Bricks connection and editing its URL like so: https://adb-368901377989901.1.azuredatabricks.net/?o=368901377989901 such that we add on this to the end: #secrets/createScope. We set our ScopeName to db_scope. Ensure that Manage Principal reads as All Users as Creator requires the premium subscription. For the DNS Name we can get that information from opening our Key vault -> Properties -> Vault URI (https://dbkvdemo.vault.azure.net/). The Resource ID is found on the same page under Resource ID and starts with /subscriptions/.

With our Secret Scope created, we can now create a new notebook and try accessing our Data Lake. To instantiate the access, we want to fill out the following code:

configs = {"fs.azure.account.auth.type": "OAuth",
          "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
          "fs.azure.account.oauth2.client.id": "<application-id>",
          "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
          "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

Where we replace the following fields with this data:

Lastly, we need to provide some accesses to our Data Lake. To do so, navigate to the Azure Storage accounts and select Access Control (IAM). Next we select Add role assignment and select Storage Blob Data Contributor because we want to read and write data through our Data Bricks application. We need to assign this privilege to a member, so we click + Select members and in the search box, we enter our App registration name that we made earlier in this process named: db_app. We then review/assign the assignment. This may take a few minutes for the security to work.

Now we can inject all of this data into our Data Bricks Notebook like so:

And we can see we now that have access to the parquet files located in our Data Lake and we can proceed with cleaning our data which we would save separate copies in our BASE and not RAW folder to maintain control.

5 Setup Azure DevOps in Synapse for Git control

Lastly, we want to create and setup Azure DevOps inside Synapse for a few reasons. First, we want to instantiate version control and second, we want to backup our work.

Our first step is to access Azure DevOps (https://dev.azure.com/) which is hosted outside of the usual Azure services cloud portal. Inside DevOps, we want to create a new project, called synapse_demo and initiate a repo

Next, in Azure Synapse, we want to select the Manage tab on the left and select Git configuration and connect to our recently created repo on DevOps. For the collaboration_branch we should create a new branch called dev (unlike the picture below) to separate our work from main.

We now have version control working in Synapse. Next, we will hit Publish at the top to send our work to DevOps and we will notice that in the dev branch, Synapse has pushed our pipelines and integrated dataset connections to DevOps. Our work is now backed up.